These notes are based on Chapter 2.7 & 7.4-7.6 of Kroese, D. P., & Chan, J. C. (2014). Statistical modeling and computation. New York: Springer.

Introduction

To use Monte Carlo methods, first we specify random variables and then sample from them. In the examples of Monte Carlo integration, we use only uniform and exponential distributions, so sampling is not a problem; we can just call R built-in samplers (runif or rexp).

In general, this is not always the case. For example, in Bayesian Statistics, after computing the posterior PDF using Bayes’ formula, we are most likely interested in its mean. However, as you can imagine by noticing the products and division involved in the formula, posterior PDFs can easily become unwieldy. Therefore, we would like to use MC integration, but how can we sample from those unwieldy PDFs that do not resemble any common PDF?

In these notes, we cover three sampling methods in ascending order of sophistication:

  • Inverse transform method
  • Acceptance-rejection method
  • Markov chain Monte Carlo (MCMC)

Inverse transform method

The method is applicable only to cases in which we can invert the (part of) CDF of \(X\). If applicable, it is fast and reliable.

Invertible \(F\)

To illustrate the idea, consider the case where \(F\) as the CDF of \(X\) is invertible (e.g., \(X\sim N(0,1)\)). Given \(U\sim U(0,1)\), then we have a random variable \(Y=F^{-1}(U)\) which has the same CDF as \(X\). That is, given a sample \(u\) of \(U\), \(y=F^{-1}(u)\) is a sample of \(X\). Here is how:

\[\begin{align} P(Y\leq y) &= P(F^{-1}(U)\leq y)\\ &= P(U\leq F(y))\\ &= F(y) \end{align}\]

par(mar=c(2,.5,1,.5))
f <- function(x) 1/(1+exp(-(x-2)))
curve(1/(1+exp(-(x-2))), from=-2, to=7, axes=FALSE, ann=FALSE)
axis(side=1, at=c(-4,0,2.5,8), labels=c("","0","X",""))
abline(v=0)
lines(c(0,7), c(1,1), lty=3)
arrows(0,f(2.5),2.5,f(2.5),length=.1,angle=12)
arrows(2.5,f(2.5),2.5,-.01,length=.1,angle=12)
text(0,f(2.5), "U", pos=2)
text(0,1, "1", pos=2)

Note that the definition \(Y=F^{-1}(U)\) implies \(U=F(Y)\). Since \(X\) and \(Y\) have the same distribution and \(X\) is an arbitrary continuous random variable, we have \(U=F(X)\) — if we transform an arbitrary \(X\) using its CDF, we have a random variable of \(U(0,1)\).

Almost invertible \(F\)

The CDFs of many, if not most, random variables are technically not invertible. Remember that a random variable \(X\) is a function whose codomain is the entire real line \(\mathbb{R}\). Thus, to be invertible, \(F\) must be strictly increasing over \(\mathbb{R}\), Equivalently, the support of \(X\) must be \(\mathbb{R}\). This condition excludes many random variables including Exponential and Gamma whose support is only the half real line \([0,\infty)\).

However, the CDF for \(X\) with the support \([0,\infty)\) is almost invertible; if we define \(G\) to be equal to \(F\) with the domain \([0,\infty)\) instead of \((-\infty,\infty)\), then \(G\) is invertible and we can generate a sample \(G^{-1}(u)\) of \(X\).

\(F\) for discrete \(X\)

When \(X\) is discrete, \(F\) is a step function as below.

To deal with a more general case, we may use the following \(H^{-1}:(0,1)\to\mathbb{R}\) to generate a sample \(H^{-1}(u)\) of \(X\): \[ H^{-1}: u \mapsto \inf\{x : F(x)\geq u\} .\]

Notice that \(H^{-1}\) is a generalisation of \(F^{-1}\) or \(G^{-1}\); \(H^{-1}\) still works in those special cases.

In practice, when \(X\) has the finite support, we can use a R built-in function sample(x, prob=NULL, ...) to which we pass the PMF as prob.

Example

Let’s generate samples from \(\text{exp}(1)\) whose CDF is \(F(x)=1-e^{-x}\). Solving the following \[\begin{alignat}{2} && u &= 1-e^{-x}\\ &\Leftrightarrow\quad & x &= -\log(1-u)\\ \end{alignat}\] we find \(G^{-1}(u) = -\log(1-u)\).

Below is a histogram together with the actual PDF.

n <- 1e6
u <- runif(n)
x_exp <- -log(u) # U and 1-U have the same distribution
x <- x_exp[x_exp<6]
hist(x, seq(0,6,.1), freq=FALSE, ylim=c(0,1), ylab="f(x)", main="PDF of exp(1)")
curve(exp(-x), from=0, lwd=2, add=TRUE)

Acceptance-rejection method

Even when the inverse of \(F\) exists, inverting \(F\) is often analytically impossible or numerically too expensive. In these cases, we may try the acceptance-rejection method.

The acceptance-rejection method is based on the following intuition.

Obtain a random sample \((x,y)\) using uniform sampling from the region formed by the \(x\)-axis and the density curve of \(X\). Then, keep \(x\) as a sample from \(X\).

The beauty of this method is that we need not directly sample from the target PDF, but only uniformly sample from the region under the graph of the target PDF.

Recall that, by uniform sampling from a set \(A\in\mathbb{R}^d\), we mean a random variable (\(d=1\)) or vector (\(d>1\)) \(X\) with the support \(A\) and the (joint) PDF \(f(x) = \mathbb{1}_A(x)/|A|\).

How can we obtain random samples uniformly taken from that region? A trick is:

  1. Find another PDF \(g(x)\) from which we can sample.
  2. Find a constant \(C>0\) such that \(Cg(x)\geq f(x)\) for all \(x\in\mathbb{R}\).
  3. Uniformly sample from the region under the curve \(Cg(x)\).

In repeating step 2-3, if we keep only samples that fall under the curve \(f(x)\), those samples \(\{(x,y):0<y\leq f(x)\}\) are uniformly taken from the desired region.

The claim is intuitive but may not be obvious. So, two key facts are stated as lemmas below.

Example

Let’s generate samples from \(X\sim N(0,1)\). There are two observations to make.

Since the PDF is symmetric around 0, once we generate positive samples, we can randomly flip their signs to generate negative samples. The PDF for the positive half can be derived by re-normalising the original PDF: \[ f(x) = \begin{cases} \sqrt{\frac{2}{\pi}}e^{-x^2/2} & x\geq0\\ 0 & \text{otherwise} \end{cases} . \]

Second, we can bound the above \(f\) by \(Cg(x)\) where \(g\) is the PDF for \(\text{exp}(1)\) and \(C=\sqrt{2e/\pi}\).

Notice \(C = \max_x \frac{f(x)}{g(x)}\).

curve(sqrt(2*exp(1)/pi)*exp(-x), from=0, to=4, ylab="f(x)", lty=3)
curve(sqrt(2/pi)*exp(-x^2/2), from=0, to=4, add=TRUE)
# title(paste(expression(f(x)),"bounded by",expression(sqrt(2))))
title(bquote(f(x) ~ "bounded by" ~ Ce^-x))
legend(2.9, 1.2, lty=c(1,3),
       legend=c("Half normal", expression(Ce^-x)))

Here is a result. Note that we reuse the samples from \(\text{exp}(1)\) generated by the inverse transform method above.

half.normal <- function(x) sqrt(2/pi)*exp(-x^2/2)
C <-sqrt(2*exp(1)/pi)
y <- runif(length(x_exp))*C*exp(-x_exp)
x_pos <- x_exp[y <= half.normal(x_exp)]
ind <- 1:floor(length(x_pos)/2)
x_neg <- -x_pos[ind]
x_pos <- x_pos[-ind]
x <- c(x_neg[x_neg>-4], x_pos[x_pos<4])
res <- hist(x, 100, freq=FALSE, ylab="f(x)", xlim=c(-4,4), main="PDF of N(0,1)")
curve(max(res$density)*exp(-x^2/2), lwd=2, add=TRUE)

Lemma 1

A uniform sample \((x,y)\) from the region under the curve \(Cg(x)\) can be obtained by first sampling \(x\) from the PDF \(g(x)\) and then sampling \(y\) from \(\text{U}(0,Cg(x))\) given \(x\).

Let \(h\) be the conditional PDF for \(\text{U}(0,Cg(x))\) given \(x\): \[ h(y|x) = \begin{cases} \frac{1}{Cg(x)} & y\in[0,Cg(x)]\\ 0 & \text{otherwise} \end{cases} . \] Then, observe the following joint PDF of \((x,y)\): \[\begin{align} f(x,y) &= h(y|x)g(x)\\ &= \frac{1}{Cg(x)}g(x)\\ &= \frac{1}{C}, \end{align}\] which implies the uniform (joint) distribution.

Lemma 2

Let \(A,B\in\mathbb{R}^2\) such that \(A\subseteq B\). If \(\{x_n\} = x_1,x_2,\dots\) is an IID sequence obtained by uniform sampling from \(B\). Then, a subsequence \(\{x_{n_k}\} = x_{n_1},x_{n_2},\dots\) such that \(x_{n_k}\in A\) for all \(k\geq1\) is an IID sequence obtained by uniform sampling from \(A\).

When you watch the animation in Estimating \(\pi\), don’t you think dots look uniformly distributed inside the quarter disk?

First, consider the probability that \(x\) falls inside a small region \(dA\) inside \(A\). \[\begin{align} P(x\in dA |x\in A) &= \frac{P(x\in(dA\cap A))}{P(x\in A)}\\ &= \frac{P(x\in dA)}{P(x\in A)}\\ &= \frac{\int_{dA}fdx}{\int_{A}fdx} \end{align}\]

Then, by the fundamental theorem of calculus, the conditional density is: \[\begin{align} f(x|x\in A) &= \lim_{dA\to 0}P(x\in dA | x\in A)\\ &= \lim_{dA\to 0}\frac{\int_{dA}fdx}{\int_{A}fdx}\\ &= \frac{f}{\int_{A}fdx}\\ \end{align}\]

An implication is that, if \(x\) is uniformly distributed over \(B\), \[\begin{align} f(x|x\in A) &= \frac{\frac{1}{|B|}}{\frac{|A|}{|B|}}\\ &= \frac{1}{|A|}\\ \end{align}\]

which is a constant and integrates to \(1\) over \(A\).

Hence, a subset of uniform samples from \(B\) that also belong to \(A\) are indeed uniformly distributed over \(A\).

Markov chain Monte Carlo

In principle, the acceptance-rejection method can work in high dimensions if we can find a PDF \(g(x)\) from which we can efficiently sample and \(C\) for a tight envelope \(Cg(x)\). It turns out this is a big IF.

In 1-D case, looking at the graph of target \(f(x)\), we are often able to design a reasonable \(Cg(x)\). Even if an envelope is not very tight, we can wait for some time and obtain enough samples from \(U(0,Cg(x))\) that are smaller than \(f(x)\).

This is unlikely the case in high dimensions. Our visual images in a 4-D or higher space are poor (at least for me). In addition, the number of enough samples exponentially increases with dimensions, and we may not be allowed to wait for hours or days.

Markov chain Monte Carlo (MCMC) is a general class of methods to sequentially generate samples from an approximate distribution that increasingly resembles a target distribution. A sequence of samples are generated by running a Markov chain, which is designed to have the limiting distribution equal to the target distribution. By being content with only approximate distributions, MCMC enables us to handle functions on a high-dimensional space, which are common in modern complex problems.

Since Monte Carlo simply refers to stochastic simulation, we might just call it Markov chain simulation. But, MCMC has sort of become the standard name everybody using statistical computing understands, perhaps because the acronym is just catchy.

Markov chain

To present practical tools in a small space, we gloss over much of the Markov chain theory, which is quite rich and fascinating!

Definition and notations

A Markov chain is a sequence of random variables \(X_1,X_2\dots\) whose futures are conditionally independent of the past given the present. That is, for all \(i\geq0\), \[(X_{i+1}|X_i,X_{i-1},\dots) \sim (X_{i+1}|X_i) .\]

In an IID sequence (e.g., Bernoulli trials), futures are independent even of the present.

For MCMC applications, we are interested in continuous random variables \(\{X_i\}\). In this case, for \(A\subset\mathbb{R}^d\), \[P(X_{i+1}\in A|X_i=x_i,X_{i-1}=x_{i-1},\dots) = P(X_{i+1}\in A|X_i=x_i) .\]

In particular, we restrict ourselves to Markov chains for which the conditional PDFs \(f_{X_{i+1}|X_i}(y|x)\) do not depend on \(i\) and denote it by \(q\). That is, \[q(y|x) = f_{X_{i+1}|X_i}(y|x)=f_{X_1|X_0}(y|x)\] for all \(i\geq0\). We call \(q\) the transition density of the Markov chain.

A probability distribution is called stationary if its PDF \(f\) satisfies the following global balance equations: \[f(x_{i+1}) = \int f(x_{i})q(x_{i+1}|x_{i})dx_{i} .\]

You may see \(f(x_{i})q(x_{i+1}|x_{i})\) as the joint distribution of \(X_i\) and \(X_{i+1}\). If we integrate out \(X_i\), we have the marginal distribution for \(X_{i+1}\) as usual. In this special dependency, however, the marginal for \(X_{i+1}\) turns out the same as the one for \(X_i\).

Why Markov chain?

Given the initial value \(x_0\), if we can draw samples from \(q\), then we can simulate a Markov chain and sequentially obtain samples \(x_0,x_1,\dots,x_n\).

The rationale for Markov chain as a method for sampling from our desired distribution \(f\) is:

  1. Certain Markov chains converge to unique stationary distributions as \(n\to\infty\).
  2. We can design a Markov chain whose limiting distribution has the target \(f\).
  3. After running the Markov chain long enough \((i\geq I)\), we may expect the converging distribution to be almost stationary and samples \(x_I,x_{I+1},\dots,x_n\) with \(n>>I\) to resemble IID samples from \(f\).

Samples from the first \(I\) steps (so-called “burn-in” period) are often discarded.

Hence, our task in MCMC is to design a good Markov chain.

Design principles

Ergodicity is the condition for a Markov chain to converge to a unique stationary distribution irrespective of \(x_0\). It consists of two intuitive properties: irreducibility and aperiodicity.

Irreducibility ensures that every value \(x\) can be visited from every other value within finite steps \((i<\infty)\). This is important because, in MCMC, a sequence of samples are generated by dependent sampling according to the transition density \(q\), and some \(x_{i+1}\) may not be reached directly from the current \(x_i\), i.e., \(q(x_{i+1}|x_i)=0\). Irreducibility guarantees that such \(x\) will eventually be reached.

Aperiodicity prevents indefinite oscillation and allows the chain to converge to a stationary distribution.

See Roberts & Rosenthal (2004) for more details.

Among the ergodic Markov chains, we could in principle choose a transition density \(q\) that satisfies the global balance equations to ensure that the chain converges to our target \(f\). In practice, however, this is tricky and challenging. So, we typically impose stronger structure called the detailed balance equations. \[f(y)q(x|y) = f(x)q(y|x), \;\forall x,y.\]

By integrating both sides over \(x\), we see the detailed balance equations imply the global balance equations.

Demo

Here is an example of MCMC for sampling from a bivariate normal distribution \(N(\mu,\Sigma)\) with \[ \mu=0 \quad\text{and}\quad \Sigma=\begin{pmatrix}1 & 0.8\\0.8 & 1\\\end{pmatrix} .\]

In practice, of course, nobody should use MCMC for sampling from normal distribution. Inatead, use function mvrnorm from MASS package.

Remember that the joint PDF looks like this:

r <- 0.8
f <- function(x,y) {
  return( 1/(2*pi*sqrt(1-r^2))*exp(-1/(2*(1-r^2))*(x^2+y^2-2*r*x*y)) )
}
X = seq(-3, 3, length=100)
Y = seq(-3, 3, length=100)
XY = expand.grid(X=X, Y=Y) # create a grid
Z <- f(XY$X, XY$Y)
s = interp(x=XY$X, y=XY$Y, z=Z)
plot_ly(x=s$x, y=s$y, z=t(s$z)) %>% add_surface()

Using Gibbs sampler, a result may look like the following.

n <- 300
m <- 5
x <- rep(0,n)
y <- rep(0,n)
x[1] <- 5*runif(1)-2.5
y[1] <- 5*runif(1)-2.5
for (i in 2:n) {
  x[i] <- rnorm(1, mean=r*y[i-1], sd=sqrt(1-r^2))
  y[i] <- rnorm(1, mean=r*x[i], sd=sqrt(1-r^2))
  par(mar=c(5,5,.1,.1))
  plot(x[1], y[1], xlab="x", ylab="y", xlim=c(-3,3), ylim=c(-3,3), pch=4)
  if (i <= m) {
    segments(x[-i],y[-i],x[-1],y[-1],lwd=.5)
  } else {
    points(x[2:(i-(m-1))], y[2:(i-(m-1))], pch=20, cex=.3)
    segments(x[(i-m):(i-2)], y[(i-m):(i-2)],
             x[(i-(m-1)):(i-1)], y[(i-(m-1)):(i-1)], lwd=.5)
  }
  arrows(x[i-1],y[i-1],x[i],y[i], length=0.08, angle=20, lwd=1)
}

Note the dependency of each new point on the previous point; each arrow goes only so far. But, collectively, points are distributed as if sampled from the target distribution.

Metropolis-Hastings algorithm

In practice, most of us do not design a transition density \(q\) from scratch; rather, we use proven designs. The Metropolis-Hastings algorithm provides a good template.

(to be written)

Gibbs sampler

(to be written)

References

  • Sec 11.2. Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
  • Ch 11. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis. CRC press.
  • Sec 6.5 & 6.14. Grimmett, G. and Stirzaker, D. (2001). Probability and Random Processes. Oxford University Press.
  • Roberts, G. O., & Rosenthal, J. S. (2004). General state space Markov chains and MCMC algorithms. Probability surveys, 1, 20-71.